Genomic epidemiology of syphilis in England: a population-based study

Summary Background Syphilis is a sexually transmitted bacterial infection caused by Treponema pallidum subspecies pallidum. Since 2012, syphilis rates have risen dramatically in many high-income countries, including England. Although this increase in syphilis prevalence is known to be associated with high-risk sexual activity in gay, bisexual, and other men who have sex with men (GBMSM), cases are rising in heterosexual men and women. The transmission dynamics within and between sexual networks of GBMSM and heterosexual people are not well understood. We aimed to investigate if whole genome sequencing could be used to supplement or enhance epidemiological insights around syphilis transmission. Methods We linked national patient demographic, geospatial, and behavioural metadata to whole T pallidum genome sequences previously generated from patient samples collected from across England between Jan 1, 2012, and Oct 31, 2018, and performed detailed phylogenomic analyses. Findings Of 497 English samples submitted for sequencing, we recovered 240 genomes (198 from the UK Health Security Agency reference laboratory and 42 from other laboratories). Three duplicate samples (same patient and collection date) were included in the main phylogenies, but removed from further analyses of English populations, leaving 237 genomes. 220 (92·8%) of 237 samples were from men, nine (3·8%) were from women, and eight (3·4%) were of unknown gender. Samples were mostly from London (n=118 [49·8%]), followed by southeast England (n=29 [12·2%]), northeast England (n=24 [10·1%]), and southwest England (n=15 [6·3%]). 180 (76·0%) of 237 genomes came from GBMSM, compared with 25 (10·5%) from those identifying as men who have sex with women, 15 (6·3%) from men with unrecorded sexual orientation, nine (3·8%) from those identifying as women who have sex with men, and eight (3·4%) from people of unknown gender and sexual orientation. Phylogenomic analysis and clustering revealed two dominant T pallidum sublineages in England. Sublineage 1 was found throughout England and across all patient groups, whereas sublineage 14 occurred predominantly in GBMSM older than 34 years and was absent from samples sequenced from the north of England. These different spatiotemporal trends, linked to demography or behaviour in the dominant sublineages, suggest they represent different sexual networks. By focusing on different regions of England we were able to distinguish a local heterosexual transmission cluster from a background of transmission in GBMSM. Interpretation These findings show that, despite extremely close genetic relationships between T pallidum genomes globally, genomics can still be used to identify putative transmission clusters for epidemiological follow-up. This could be of value for deconvoluting putative outbreaks and for informing public health interventions. Funding Wellcome funding to the Sanger Institute, UK Research and Innovation, National Institute for Health and Care Research, European and Developing Countries Clinical Trials Partnership, and UK Health Security Agency.

sequencing dataset and national surveillance rates, we also retrieved summary statistics from GUMCAD data for all syphilis patients 16 years and older in England from 2012-2018 (n=50,845).The final linked dataset was limited for patient confidentiality purposes, and included information on the year and month (where available -for a small number of samples where month was unknown, this was set to June) of sample collection, English region ("PHE centre"), patient gender and sexual orientation, patient age group, HIV status (for GBMSM patients only, to prevent deductive disclosure in heterosexuals), and whether the patient was born in the UK.
Patient metadata for samples prospectively collected from the five laboratories in Brighton, Birmingham, Leeds, London and Manchester were retrieved from local laboratory information systems.In most cases, we were able to access details on patient gender, declared sexual orientation, HIV status (for GBMSM patients and a subset of others), age, geographical region, and sample collection year.We also searched for samples linked to the same patient identifiers to enable deduplication across the full dataset.We integrated the patient metadata from these samples into the dataset retrieved by UKHSA from GUMCAD.A full description of sample and data flow is shown in the CONSORT diagram from Supplementary Figure 1.Ethics and Governance Group.Limited patient metadata was retrieved from GUMCAD, by UKHSA, and used to generate summary statistics about populations.Linkage of patient metadata to samples was performed at UKHSA, and was fully pseudonymised before transfer to WSI, where it was held on an encrypted server with access restricted to study investigators.Heterosexual individuals who were also HIV positive were expected to be rare in the dataset, and to minimise the risk of deductive disclosure of patient identity, HIV status was not provided for these individuals.Collection and sharing of residual diagnostic samples and patient metadata prospectively collected by non-referring laboratories were included within the sample ethics described above.Patient metadata for these samples was retrieved from hospital LIMS by the local investigator, pseudonymised using a key held only by the referring laboratories, before transfer to WSI.

Sequence analysis
Whole genome sequencing of all clinical T. pallidum samples used in this study has been previously described 3 , and was performed directly on the residual genomic DNA extracts from residual diagnostic samples using the pooled sequence capture method 1,4 on Illumina HiSeq 4000.For consistency across studies, the genome dataset and multiple sequence alignments from Beale 2021 3 were reused for this study, with the same approaches to site masking, recombination, and phylogenetic reconstruction used.Moreover, the underlying BEAST2 phylogeny was also reused to contexualise UK strains, both using the full time-scaled phylogeny as well as by creating derivative subtrees.Underlying bioinformatic methods describing those used and evaluations performed in Beale 2021 are described below (and specifically where additional methods were used), but detailed evaluation results are fully described in that paper 3 .
Sequencing reads were filtered using the full bacterial and human Kraken 2 5 v2.0.8 database (March 2019) to extract Treponema genus-specific sequencing reads, followed by trimming with Trimmomatic 6 v0.39 and downsampling to a maximum of 2,500,000 using seqtk v1.0 (available at https://github.com/lh3/seqtk)as previously described 1 .Sequencing reads were mapped to a custom version of the SS14_v2 reference genome (NC_021508.1)using bwa mem v0.7.17 as previously described 3 , generating whole genome multiple sequence alignments using samtools 7 v1.6 and bcftools v1.6, and requiring a minimum of two supporting reads per strand and five in total to call a variant, and a variant frequency/mapping quality cut-off of 0.8.Sites not meeting our filtering criteria were masked to 'N' in the final pseudosequence.For 12 repetitive Tpr genes (Tpr A-L), two highly repetitive genes (arp, TPANIC_0470) and five FadL homologues (TPANIC_0548, TPANIC_0856, TPANIC_0858, TPANIC_0859, TPANIC_0865), we masked both the initial reference genome using bedtools 8 v2.29 maskfasta, as well as repeating the masking over the final multiple sequence alignment using 'remove_block_from_aln.py' available at https://github.com/sangerpathogens/remove_blocks_from_aln/ to ensure sites originally masked in the reference were not inadvertently called with SNPs due to mapped reads overlapping the masked region.The pre-masked regions (and alignment positions) are shown in Supplementary Data 3).We excluded genomes with <75% of genomic positions passing filters at >5x coverage (≤25% masked sites).However, most genomes had much higher genome breadth, with 79% of genomes (n=409/520) having <10% of sites masked, and 95% of genomes (492/520) having <20% sites masked.Only 5% of genomes (n=28/520) had 20-25% of sites masked.We have previously demonstrated inclusion of genomes with masking ≤25% does not substantially affect the tree topology when compared to phylogenies constructed from genomes with ≤5% masking, nor does selection of reference genome play a substantial role 3 .

Phylogenomic analysis
As previously described for this genomic dataset, we screened the resulting multiple sequence alignment for recombination using Gubbins 9 v2.4.1 (masked regions shown in Supplementary Data 3), generating whole genome maximum likelihood phylogenies using a K3Pu+F+I model and 10,000 UltraFast bootstraps in IQ-Tree 10 v1.6.10, and inputting SNP-only alignments plus missing constant sites using the `-fconst` flag 3 .
For analysis of UK genomes in global context, we used our previously described dataset of 526 TPA genomes (including the 237 UK samples studied here, as well as 289 non-UK samples from 21 countries, representing all high quality TPA genomes available at the time of study) 3 with 901 variable sites.The UK-specific multiple sequence alignment which was subsampled to construct minimum spanning trees contained 237 genomes and 445 variable sites.
For temporal analysis, we had previously evaluated this dataset using both a Strict Clock model 11 (starting rate prior 1.78 × 10−7) and an Uncorrelated Relaxed Clock model 12 , with HKY substitution model 13 and diffuse gamma distribution prior 14 (shape 0.001, scale 1,000) over 100 million Markov Chain Monte Carlo (MCMC) cycles with 10 million cycle burn-in.We evaluated constant, relaxed lognormal, exponential and Bayesian Skyline (10 categories) population distributions 15 .We used the marginal likelihood estimates from the triplicate BEAST runs as input to Path Sampling and Stepping Stone Sampling analysis 16,17 , and this suggested that the Strict Clock with Bayesian Skyline was the optimal model for this dataset, with an inferred molecular clock rate of 1.23 × 10 −7 substitutions per site per year.
To analyse the full dataset we used a time-scaled Bayesian maximum credibility tree of 520 global genomes (after excluding heavily passaged samples or those with uncertain collection dates, 883 variable sites) previously generated and evaluated using BEAST2 18 v2.6.3, using a Strict Clock with reference rate prior of 1.23 × 10 −7 substitutions per site per year, HKY substitution model 13 , and Coalescent Bayesian Skyline distribution with 10 populations 13,15 over 500 million MCMC cycles in triplicate.All chains converged with effective sample sizes >200, we used logcombiner v2.6.3 with a 10% burn-in to generate consensus log and tree files, resampling 100,000 states for the full 520 sample analysis, and we used treeannotator v1.8.4 to produce a median maximum credibility tree.The temporal signal in this dataset was confirmed using tip date resampling (TIPDATINGBEAST 19 ), implemented through 20 datasets with randomly reassigned dates run under the same conditions over 500 million MCMC cycles.
We performed joint ancestral reconstruction 25 on the maximum likelihood phylogeny using pyjar v0.1.0(available at https://github.com/simonrharris/pyjar),and extracted the patristic pairwise SNP distance between TPA genomes using the `cophenetic.phylo`function in ape v5.5 to infer pairwise SNP distances between TPA genomes.We constructed single-linkage 'edge list' networks 26 of closely related genomes using the network 27 v1.17.1 package in R, after first filtering all potential sample linkages (edges) to remove those not meeting defined thresholds of pairwise SNP distance (zero SNPs, or less than or equal to two SNPs, dependent on analysis), pairwise temporal distance, and/or sharing the same geographical region.Extraction of network component information was performed using iGraph 28 v1.2.9.Networks were plotted in R using the ggnetwork 29 v0.5.10 and ggplot2 30 v3.3.5 packages.Macrolide resistance alleles were inferred using the competitive mapping approach previously described 1 (available at https://github.com/matbeale/Lihir_Treponema_2020/competitive_mapping_Treponema23Smod.sh). We Ethical approval for all clinical samples was granted by the London School of Hygiene and Tropical Medicine Observational Research Ethics Committee (REF#16014) and the National Health Service (UK) Health Research Authority and Health and Care Research Wales (UK; 19/HRA/0112).Diagnostic samples meeting selection criteria were identified at UKHSA using internal laboratory information systems (LIMS).Samples had no patient identifiable information other than an internal sample number, and were transferred to the WSI for testing and whole genome sequencing, initially linked to sample collection year only.UKHSA has permission to process confidential patient data under Regulation 3 (Control of Patient Information) of the Health Service Regulations 2002.Information governance advice and ethics approval for this study were granted by the PHE (now UKHSA) Research

6 .
Zero SNP distance clusters show heterosexual transmission networks as well as mixed networks.A -Delineation of genomes into clusters separated by zero-pairwise SNPs, indicating sublineage and gender orientation of patients.B -Proportion of each cluster by UKHSA Region.

Figure 3. UK genomes are distributed throughout the global phylogeny. Maximum
downloaded publicly available map data with Public Health England Region boundaries from the UK Office for National Statistics (https://geoportal.statistics.gov.uk), and used the rgdal 31 v1.5-27, broom 32 v0.7.10 and ggplot2 packages to process and plot map data, and the scatterpie 33 v0.1.7 package to plot pie charts.To test the expected number of sublineage 14 samples in the North of likelihood phylogeny of 526 genomes from 22 countries, highlighting UK genomes.Tip points are coloured by phylogenetic sublineage(1-17, Singleton), and colour tracks indicate sample continent, whether a sample was collected in the UK, Lineage and Sublineage.
for 22.3% of 17 samples using the `rpois` and `dpois` commands in R.SupplementaryFigure 2. Distribution of all UK genome samples, according to geographical region.Plot shows collection years (A) and total sample counts (B) per UKHSA Region, then proportion of each group by Gender Orientation (C), HIV Status (D) and Age Group (E).Numbers in bar plots indicate exact sample counts.

Supplementary Figure 7. Bayesian time-scaled maximum credibility phylogeny of 526 globally sampled TPA whole genomes.
Colour tracks indicate Sample Collection Year, Country, UK or Other, and UKHSA Region of England.

Supplementary Figure 8. Bayesian phylogenomic and spatiotemporal analysis of major sublineages in England, using UK and Global samples.
Maximum credibility subtree phylogenies derived from a total dataset analysis.A-Sublineage 1, B-Sublineage 2, C-Sublineage 8, D-Sublineage 14. Node points are shaded according to posterior support (black ≥96%, dark grey >91%, light grey >80%).Pink bars on nodes indicate 95% highest posterior density intervals.Colour tracks indicate Sample Collection Year, Country, UK or Other, and UKHSA Region of England.